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INTRODUCTION AND MACHINE CHARACTERISTICS 

The purpose of this report is to describe and justify several approxi- 
mations for the elementary non-rational functions which are, in our 
opinion, particularly suited for the Control Data 1604 computer. The 
manual for the computer has been carefully consulted since some of the 
machine characteristics have a decisive influence on the selection of 
best approximations. The most important of these characteristics are: 
(i) Control Data 1604 is a BINARY COMPUTER 
(ii) 1 word = 48 bits: sign + 47 bits for fixed point 

= exponent (11) + {sign + 56) bits for floating point 
Thus basic round-off * € = 2~^ = 3.55 x lO"''"^ (fixed) 



o 

J = 2~^' = 7.28 X 10^' (floating) 



$ ^, - t~" = -7 TQ ^ tn' ^ 



(iii) Average execution times are about: (in microseconds) 
Fixed point Add 7 multiply 45 divide 65 
Floating Point Add 19 multiply 45 divide 56 

(iv) Size of memory: 32768 words of core, all equally accessible. 
(Most customers will also use tape units) 



Conclusions: 

These machine characteristics will have the following effects on 

subroutines for special functions: 

(i) The basic ranges for such functions as exp (X), log X, 

VjCand X are quite small; further reduction will not 
be necessary, 
(ii) For fixed point subroutines, the truncation error Jl 

■ — — ^— — ^— __ y'V o 

should be smaller or about equal to € : 
3. 55-10' 



X< ,.....-15 



') -14 

though /\^^10 may be acceptable. 

For floating point subroutines, the relative error /\ 
should be at most C j : 



-12 
•10 



Aj ^ 7.28-: 

Internal round-off can be reduced by coding the sub- 
routine internally in fixed point, using (some of) 
the 11 exponent bits, 
(iii) Division time ^3/2 multiplication time; therefore, 
fractional approximations can be used to great 
advantage . 
(iv) Though it is always desirable to make subroutines short, 
this restriction is not quite so serious with a 32,768 
word memory as with 2000 or 4000 words. Even a short 
table of key values may be considered if this helps to 
save time. Extensive tables, however, should in general 
be avoided. 



APPROXIMATIONS FOR THE SQUABE ROOT 

(1.1) Range ; 

For a floating point subroutine, the exponent of X will 
be separated from the mantissa and the two cases, 
"exponent even" and "exponent odd", will be treated 
separately. The latter case is equivalent to 
Vj ^ mantissa ^ |". 

For a fixed point subroutine, the number will be "half- 
normalized" by an even number 2n of left-shifts and the 
two cases are then i ^ ^ • 2 < 1 and J ^ X • 2 < !• 

These two ranges can be treated jointly or separately. 
Separation means better initial approximation and may 
save one iteration, depending on the type of initial 
approximation used and on the accuracy required. 

(1.2) The Newton Iteration Formula 

The theory of rational approximations to the square 

root can be understood best on the basis of the Newton 

Iteration Formula: 

If Y£ is an approximation to Y = •'X, 

then 



^i+i == ^ <^i * ; ) *) <^i'2.i) 

i 
will be a better approximation. Let O r be the "relative" 



*) Throughout this report, the notation (:=) stands for 
"is defined by" (of. ALGOL). 



or logarithmic error of Y. , viz. 

(T^ := m (Y^/Y) (1.2.2> 

The logarithmic error of Y. will then be 



<*^i+l '= ^^ ^^+1^ = 1" ^^°^^ ^0 (1.2.3) 



i+1 • "" ^"i+1 
whence 



This is the Newton Iteration in its standard form. 
It can be improved as follows: If the maximum of 
icTj^l is known (for a given interval [x^, X J ), 

Aj^ = max \^A 

[Xj, xp 

Then the maximum of I o . ^ | can be halved by 

redefining Y^+i := ^i '*" ^^Aj) 
2 V^cosh X. 



so that 



(1.2.4) 



(1.2.5) 



(1.2.6) 



X i^i :=^max \^ .^J = i j^ (cosh A.) (1.2.7) 

It will be noted, however, that this improvement by 
merely a factor 2 requires a true multiplication, while 
the original iteration formula does not, since division 
by 2 can be done by a right shift. Therefore, we shall 
use the original Newton formula for iteration . The 
improved formula, however, immediately leads to the best 
linear approximation. 



(1.3) Best Linear Approximation 

In order to find the best linear approximation, we start 
out with the best constant Yq ^^^ apply one improved 
Newton Iteration. Obviously, 

is the "best" constant approximation to VX for the 
interval [x-j^, X2J , yielding the maximum error 



CI. 3.1) 



X o = max, I In (Y /Y)| 



Xo 

In _£ 



(1.3.2) 



One "improved Newton step" yields 

Y, = Io_l^_ ^^,^^ 
2 /cosh Aq 



with 



a : = 



iajkll 



\/2 [(X2/X;l>^ * (X^/X2)^] 



CXi X,) 



I 
^4 



\/2 [(X2Ai)^ + (Xi/X2)^] 



(1.3.3) 



The relative error of this approximation is: 



A = max I In (Y^/Y) 



In 



CX2/Xj)^ + (X^^/X^) 



(1.3.4) 



or approximately 



In (X^/X^) 



1 2 



(1.3.5) 



It may be desirable (e.g. for scaling reasons) to have 



no error at the ends of the interval, i.e., for X = X and 
X = X^. The solution is: 



Yj = a + bX 



\ /Xi X2 

a := — = 

i/x*i +\/x^ 



b := 



\/x^ + l/x" 



and the maximum error is exactly doubled: 

1 4 



Aj^ = In 



m-m 



^ 2 



In (Xg/Xi) 



8 



(1.36) 



(1.5.7) 



A few numerical values are given below, including the 
errors after 1, 2 and 3 iterations (Standard Newton 
Iterations): 



Initial error 


A. 


XjAi - 4 


Xj/Xj 


= 2 


1. 


X. 


1. 


X, 


5.89 10~2 


2.95 10"^ 


1.49 10 ^ 


7.47 lo"^ 


1 Iteration 


A. 


1.74 10"^ 


4.34 10"'^ 


1.11 10"'^ 


2.79 10"^ 


2 Iterations 


A, 


1.51 10"^ 

■ 


9.40 10"^ 


6,16 10~ 


3.89 10-1° 


3 Iterations 


X4 


1.14 10-^^ 


4.42 10"^^ 


1.89 lO"'^^ 


7.57 lO-^'' 



(1.3.8) 



Ler 



Conclusions : Three iterations are needed after a linear initial approxi- 
mation but the range \^ , ll need not be separated into two smallc 
ones. A ^ is good enough for floating point, A^ is just fine for 
fixed point. 



(1.4) A Simple Fractional Approximation 

The linear approximation Yj = a + bX requires 

1 addition + 1 multiplication, a fractional approximation 

of the form 

b 

Y, := a + 

1 c + X 

requires little more, viz., 2 additions and 1 division. 
This investment will pay off if we can thereby save one 
iteration. This is indeed the case. 



(1.4.1) 



It can easily be shown from the general theory of 
approximation that we can expect our new Yj (X) to equal V X 
at three points rather than two (for the linear case). For 

reasons of scaling, viz., to avoid spill when computing 

-47 
the square root of 1 - 2 , we shall choose 1 as one of 

these points. It can also be shown that the third value 

of X, say X,, where Yj(X) = VX, must be the square of the 

second, if the relative error is to be minimized. 

Therefore, we choose: 



a + 



= /x for 



c + X 

The solution is relatively simple: 
a=l+oC+o<^ 



\ 


S 


1 


h 


= 


0< 


X 

3 


= 


ex 



-[0C+2 0<'^ + 2C<^ + 20<^+ CX^] 



(1 +<X )^ (1 + OC 2).e< 



Oi + OC^ + Oi 



ex. 



(1.4.2) 



(1.4.3) 



For a fixed point square root routine, I recommend the following 
initial approximations: 



(i) UPPER RANGE: 



i ^X <1 







Fitting points: 


^1 = 


1 










^2 = 


ex 










S = 


« 


error: Xj-gi = 


4,0 X 


10-^ 




^1 


= 


42217 
16384 








h 


_ 


12^30502 


29565 








3 43597 


38368 






c. 




, 46 01653 









«= i22 

128 
In order to avoid 
binary round-off 
of the constants 



a 
"4 

b 
16 



'1 20 97152 



1 = +.511644 (octal) 

= -.2411246171475 (octal) 

= +.43067152 (octal) 



(1.4.4) 



(ii) LOWER RANGE : i$X<J 
Pitting Points: X, = J 



^2 = 



error : 



'^rel. " *• 



o<- 



151 
128 



3 X 10 



-"4 



After an easy transformation we obtain 
58513 a2 



^2 " ■*■ 32768 



b - - ^^ 05801 37335 
2 "27 48779 06944 



^ ^ 59 16955 
*^2 83 88608 



+ .711104 (octal) 



— = - .32636267 122734 
(octal) 



= + .26444407 (octal) 



(1.4.5) 



The octal forms have already been scaled for the following 
recomiaended algorithms: 





UPPER RANGE 




LOWER RANGE 


<Yl> .= 


4 (cj 
4" 


A6) 
) + X/4 


2 11 ^1 
2 


.= (£2) ^ (b2/4) 


2 


2 (C2/2) + V2 


X 


Y„ := ^/* . Yi/2 i 
2 Yi/2 1 


/ 




X/2 




All divis 


ion by 2 


and 4 should be executed as unrounded 



right-shifts. No spill should occur, I think, even if 

-»47 --46 

X Ls very close to 1, such asX=l-2 orX=l-2 

Accuracy: The relative errors for Y^ , Y , Y are: 



(1.4.5) 



UPPER RANGE 



X 



4.0-10 



-4 



^ 2 = S.O-IO""^ 

\ "15 

A3 = 3.2'10 '■'' 



maximum absolute error 



Compare: 



LOWER RANGE 



Aj^ = 4.3 .10 



^4 



,-8 



^2 = 9.4-10" 
X, = 4.4-10''^^ 



^ ab... = 3.2 -10 



3 
-15 



0-48 



3.5 -10 



.-15 



(1.4.6) 



Note: I assume that these data are correct and reasonably 
accurate, but I did not have a chance to check 
them as carefully as I should like to. 



10 



(1.5) Fractional Approximation for Floating Point 

Since somewhat less accuracy is required for floating 
point, a simple fractional initial approximation of the 
form (1.4.1) may be used for the "full range", as e.g. 
1^, IJ . It is more convenient for deriving the formulae 
below, to treat the range 

Since there are no scaling difficulties in floating point, 
and since the point X = 1, being in the logarithmic center 
of the interval will be one in which our initial approxi- 
mation will be exact, we can use an exact best-fit 
approximation for this interval, minimizing the relative 
error, i.e. minimizing 

A , := max I Infyy •x)l 
' [h 2]' ^ ^ 

where 



(1.5.1) 



CI. 5. 2) 



Y, := a + 



c + X 



(1.5.3) 



It can be shown that the constants a, b, c can be 
computed as follows: *) 



1 



w := [iv^ (v2-l)J ^^-^ S 6.605 85 4497 

z := \^w2 - w + 1 » / w + 1 ~ 6, 1^3498 4974 

V w + 1 



• 2 * 



3/4 



[/w + 1 + /2z - w + 2 J [z + w - 2 J 

1 + m 

a = c = ■; = 3.09031 5520 

1 - m 



b = 1 - a'' s -8.55005 0013 



.51104 01655 



(1.5.4) 



J 



*) Derivation published in Math. Comp. 19^0 
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Note: While the exact numerical value of m is not 

critical (it should rather be chosen a trifle 
too big, but not smaller than the exact value), 
the relation b = 1 - a^ should be exactly 
fulfilled. I recommend, therefore, to truncate 

m to, say, 17 binaries (rounding up) and then 

2 
to compute b = 1 - a (by machine). 

Error bounds : The maximum relative error of the approximation 
(1.5.3) - (1.5.4) is 



whence 



X J = 2.52614-10 
^2 = 3.19-10"^ 
X, = 5.09-10-^2 



-37 -12 

2 = 7.28-10 



-3 



Compare: 

To shift the range from [l, 2] to |^>P/2, 2>fJ , 

where M is an arbitrary number, take 



(1.5.5) 



a = 



1 -t- m 

1 - m 

1 -*■ m 

1 - m 



&^5 



■¥f 



b = 



[' -{f^)> 



3/2 



i^ 



for :^ ^ X ^ 2^ 



(1.5.6) 



In particular, if 



1 + m 

and b, 

1 - m ^ 



-a^ + 1 



(with m = .51104 01655) 



(1.5.7) 



then 



a^.2 
ai-2 



bj-2 



n 

2n 

5n 



(1.5.7) 



2n-l^ . 2n + 1 
for 2 ^ X ^ 2 
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2. APPROXIMATIONS FOR EXP (X) 

All approximations In this section are based on the well- 
known continued fraction 

e^=l.^. Xil^ XU, Xil, Xij, XU,. . . 
I2-X 16 I 10 I 14 I 18 f 22 

which can also be written in the form 

X , 2X S(X^ ) + X 

e = 1 + ~ = (2 2) 

S(X >- X S(X2).- X 
where 



S(X^) = X cth - = 2 + 



I 6 I 10 I 14 



The first four approximants to S(X ) are: 



So 


:= 2 








Si 


:= 2 + 


X2 
6 






S2 


:= 2 + 


X2 


12 




6.X2 


600 






10 




60 + X2 


s , 


:= 2 + 


X2 




= 2 + X^ 


^3 


6 + 

10 


X2 
X2 








14 





(2.1) 



(2.3) 

(2.4.0) 
(2.4.1) 

(2.4.2) 



{ .05 + : —] (2.4.3) 

\ 42 + x2 / 



The last expressions in (2.4.2) and (2.4.3) are the forms which 
can be evaluated most quickly. 

Range: It is well known that for a binary machine in floating 
point operation, the range of X can easily be reduced to 

|x| ^ i In 2 (2.5) 

(cf. e.g. (2.10) below) 
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If we approximate e by 

SjCX^) _ X 



(2.6) 



then the maximum relative error will be 



X3 :=| m [e-^ - R3(X)j| = 



2.8 X 10' 



-12 



(2.7) 



for I X I = I In 2 



While this is just good enough for a floating point exponential 
subroutine it may be worth while to note that for the same 



range the best-fit approximation R, , viz., 

,*,„, S3(X^) + X _ , ^ 2X 

:= J = 1 + 

S*(X2) _ X S*(x2) _ X 



R3(X) 



with 



and 



S*(X^) := a -t- X f b + 



( ' * -T^^) 



a = 2.00000 00000 00575 924 
b = .04996 24891 36450 764 
c = 4.90315 47989 68682 648 
d = 42.01353 28950 41661 680 
reduces the error by a factor close to 256, thus: 



max I In [ e"^ R* (X)] j S l-H x 10 
for I X I ^ i In 2 



(2.8) 



(2.9) 
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Note: The above-mentioned reduction of the range to 

X ^ 5 In 2 is achieved as follows: To compute 
e^, find the integer n so that 

u = nln2 + X, |x|$| In 2 

thus 

„u _ „n X 
e = 2 e 

This n is simply added to the exponent of the result. 



(2.10) 



The only practical way to find n (for a general purpose 
subroutine) is to multiply u by (1/ln 2) and then to 
determine the nearest integer. If [ ] denotes "integer 
part of" this can be written as follows: 

V In 2^ 

n := [z * I] 

w := z - n 

X := w • In 2 
Comments : The multiplication (2.11) is due to the fact that 
we want e , while the machine has base 2, For many applications 
base 2 is just as good; for example, if the logarithmic and 
exponential subroutines are used to compute odd (or high) 
powers such as X or X , i.e. in all those cases where 

logarithm and "antilogarithm" are used as auxiliary functions, 

X 
just like Icgjo and 10 are often used for numerical computations 

without an automatic computer. 



(2.11) 

(2.12) 
(2.13) 
(2.14) 
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1 therefore, recommend that the basic subroutine computes 2 and 
that the division by In 2 (multiplication by 1/ln 2) is executed 
outside, if necessary, or is done automatically as an option 
(Separate entry to basically the same subroutine). 

The multiplication (2.14) can also be avoided since 



X _ wln2 /vT/- ^ 
= e = e <r R(w) 



if 



with 



R(w) := 



2w 



— 2 

S(w ) + w 



S(w^) - w S(w^) - w 



■«-l 



-2 — 2 - c 

S(w ) := a + w (b + — !■ ?) 

d + w 



(2.15) 



a := a/In 2 

b := b-ln 2 

c := c/ln 2 

d := d/(ln 2) 

The numerical values of a, b, c, d are the same as in (2.8), but 
those of a, b, c, d have not yet been computed (on a normal desk 
computer, double precision is mandatory; on the CDC 1604 full 
fixed point precision will just be sufficient, at least for 
b, c, d.) This reduces the number of multiplications (M) and 
divisions (D) to 2M + 2D for 2^ and 3M + 2D for e". The entire 



(2.16) 



subroutine will take around 400 Msec. 
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3. APPROXIMATIONS FOR THE LOGARITHMIC FUNCTION 

Note: Just as a subroutine for 2^ is somewhat shorter and 
simpler than one for e , the same will be true for 
log- X as compared to the natural logarithm In X. 
Therefore, we shall assume here that the subroutine 
proper will compute log^ X . The final multiplication 

In X = (In 2) log^ X (3) 

can be done outside the subroutine or it will be 
"an optional extra at additional cost". 
(3.1) Reduction of the Range 

If X is represented in the machine as a normalized 
floating-point number, then the integer part of the logar- 
ithm will be the exponent of X minus one; or if 

X= ? 2", I $1 < 1 (3.1.1) 

then 

Y := log2 X = n + m, m := log^ 1? (3.1.2) 

This reduces the range for which log- 5 must be computed 
to the interval |_ 5 , 1 J . 

It is well-known that, for any given range of the argu- 
ment of the logarithmic function, the series 

t2 t^ t^ . 
2t (1 + — + — + — + . . .) (3.1.3) 
5 5 7 



(3.1.4) 



m? 


= In 


1 
1 


+ t 
- t 


= 2t 


converges 


much 


better 


than 



« , . u2 u5 u^ 
In J =ln(l + u) = u ■>■ — + 

2 3 4 
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The same is true for the corresponding continued fractions 
and for best-fit approximations of polynomial or 
fractional form. The maximum absolute value of t can 
further be reduced for the interval 



mm 
if we put 

t := 






max 



so that log 



= log 



\* 



1 + t 
log-— 



The range of t is then given by 

5 max ~ 5r 



t 1 < t„jax 



Smax 5 o 

If we apply this method to the range 2^5 $ 1 
(cf. (3.1.1) above), we obtain 

5 =\/| = .70710 67811 86547 5244 



log-, 5 = 0.5 + 



^%2 



max 



log 



1 ■<• t 
1 - t 



^- 1 «# 

-== ~ .17157 28752 5381 

1/2 + 1 



1 -2 

Thus t is just a little less than 3-10 and each 

term of the power series (3.1.3) will add almost 2 more 
decimals; or a little more than 2 if the corresponding 
best-fit polynomial is used. 



(3.1.5) 



(3.1.6) 



(3.1.7) 



(3.1.8) 
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While the reduction (3.1.8) will be sufficient to allow 
for fairly short rational approximatins , it is worth- 
while to note that a further reduction is possible without 
introducing any new explicit operations (i.e. other than 
those for determining the proper range). For if we divide 
the interval [^^in , Smax] ^-"^0 n subintervals: 

k interval = I 5, i , 1 a , 

L-'k-l.k > ""k, k+lj 

5.i„- §,,,< 5, <■••<! , <5 -f (3.1.,, 

^'"^ n-l,n n,n+l max 

and define 5 :=|/S • 5 

k k-l,fc k,k+l 

then for each 5 € ^ P 

L=»k-l,k' ^k,k+lj 

f _ ?, (3.1.10) 



we take 



t := 



and hence have ? ? 

|t|^ t^ := ^k -^k,k-l 

?k ?k,.k-l 



= 7—7. '^k --= V^^* 1. k/?k, k- 1 



or t. = 

If the number of subintervals, n, is given then maximum 

1' *2' •••■ ^n^ ^s minimized by logarithmic sub- 
division, viz. 

^1 = r2 = . .. = r^ (3.1.11) 

but 1 inear subdivision , viz. 

^k, k + 1 = ^min ^ K (^max " ^min) (3.1.12) 

may save enough time and/or storage to offset its 
lesser theoretical efficiency. 
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EXAMPLES : 



(i) Logarithmic Subdivision, min = |, max = 1, n - 5 



? 






Yes 



0< := 5/6 



yJ:= (i>^/^ 



0< z- i 



A := Ci)^ 



<X:= 1/6 



^ := (4)'/^ 






log, f := R(t) - OC 



*^max = -05769 81098 



*) 



(ii) Linear Subdivision, min = 21 '"a^ = 1, n = 4 



0-*| k := [8f ] - 4 
' 



[]=" 



integer part of" 




f o =/5/16 = y/.3125 

Cj = 11/16; f^ = 1V16; f^ = 15/16 



*) 

t, 



'2 -'k 
1/5/4 -1 



= .05572 80900 



In these two examples, t is not much different, though 

max 

(ii) has n = 4, (i) only n = 3. 



'^^■'— ' is a suitable rational approximation. 



t) R(t) -iCi log,i :: 

*1 - t 
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(3.2) Rational Approximations for In 



1 ■► t 
1 - t 



(i) Approximations of the form 



In 



1 + t 



R(t) := t(a + 



-fl^^ 



*) 



1 - t ■ ' ■ c 
This simple approximation is not accurate enough unless 

[ 2> ij is further divided. Approximate maximum errors 

(absolute errors) are given below: 



TYPE OF SUBDIV. 


n 


f k,k+l 


X 


max ^ 

f k-l,k 


none 


1 


2 


3.29 


X 


10-^ 


logarithmic 


2 


2 


2.61 


X 


10-^^ 


logarithmic 


3 


2 1/3 


1.53 


X 


10-12 


linear 


4 


1.25 


1.20 


X 


-12 
10 


logarithmic 


4 


2 1/4 


2.04 


X 


10-^^ 


1 inear 


8 


1.125 


1.13 


X 


-14 
10 


logarithmic 


8 


2 1/8 


1.60 


X 


10-1^ 



(3.2.1) 



TABLE 
(3.2.2) 



For a standard floating point subroutine the maximum error 
should be below 7- 10" ^ , thus "logarithmic, n = 3" and 
"linear, n = 4" are suitable. For many special purposes, 
a routine which gives the logarithm in fixed point will 
also be very useful. The argument may be in fixed or 
floating representation, and a somewhat greater accuracy 
may then be required. We presently have computed the 
following coefficients of approximation. 



1 + t 



) For log2 -, divide the constants marked with an asterisk by 

In 2 = .69314 71805 59945 30941 72321 (cf. Tables Nat. Log. Vol. II, 

N.B.S. Appl. Math. Series #53) 
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max 



( f k-l,k) 



max 



1/2 



,1/3 



,1/8 



.17157 2875 



.08642 7234 



.05769 8110 *) 



.02165 7463 



a*,b*,c 



a* = .89554 02099 560 
b* = -1.82984 55434 565 
c = -1.65677 85798 852 

a* = .89055 57990 96268 
b* = 1.84630 58864 29456 
c = 1.66417 19172 70150 

a* = .88963 00669 363587 
b* = -1.84938 33168 136899 
c = -1.66555 60110 514012 

a* = .88899 31487 3553390 
b* = -1.85150 43597 8820645 
c = -1.66651 02989 0598803 



X 



3.29 10 



2.61 10 



-11 



1.53 10 



-12 



1.6 10 



■15 



For very small ranges, up to about ^max / §niin ~ l-^' 
i.e., tjuax ^-05, the "telescoping procedure for continued 
fractions"^) can be used to compute a*, b* c with 
sufficient accuracy. For this particular case we obtain, 
with € = t„,^: 



In 



1 + t 



Cat LQ. 



Po + Pi t 



* 4.2 



% ^ 11 t' 



* „ 3 6 „ 
P = 30 + 7-jT t ; P, 



max 



-8 



18 2 
5 max 



15; 



a = -9 - 2 t^ + ^ t"^ 
^1 5 max ]^o max 



TABLE 
(3.2.3) 



(5.2.4) 



from which the corresponding expression of the form (3.2.1) 
can easily be derived. 



#) cf. copy of my paper on this subject: "Methods for Fitting Rational 
Approximations", J.A.C.M. , 1960. 

*) May also be used for n = 4, linear, as long as the exact values 

X-12 
= 1.45-10 ) are not known. 
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(ii) Approximations of the Form 

In — ^ ^ R(t) := t fa* + t^ (b* + 
1 - t I 



+ t^ •" 



O 



(3.2.5) 



This approximation yields nearly three decimals more than 
(3.2.1) - for the ranges given below, but it also requires 
one additional constant, one more addition and one more 
multiplication. This may not be too high a price if the 
subdivision of the interval [2, 1 J can thereby be avoided; 
however, the error for the full range is approximately 10 
which is slightly above the basic round-off of 7-10" . 
Further error estimates are given below: 



11 



TYPE OF SUBDIV. 


n 


""" f k- 


k+1 
1, k 


X 

(appr. ) 


none 


1 


2 




-11 
10 


logarithmic 


2 


1 
2^ 




2 . 10-1* 


logarithmic 


3 


2^/^ 




-16 
5 • 10 


1 inear 


4 


1.25 




4 . 10-1^ 


logarithmic 


4 


1 

24 




-17 
4 • 10 



The constants a*, b*, c* and d for the full range 
have been computed for this report: 

a 



5 /? 

* max / *i 



= 2 



mm 

"^ max = .17157288 
maximum abs. error 



a* = 1.99999 99994 91255 
b* = .10907 88905 02997 
c* = - .77731 40010 05492 
d = - 1.39406 51451 76107 



X = 



1-10 



-11 



TABLE 
(3.2.6) 



TABLE 
(3.2.7) 



1 + t 



O For log^ divide the constants marked with an asterisk 

^ 1 _ t 

by In 2 = .69314 71805 59945 30941 72321 ... 
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(iii) Approximations of the Form 

m Ll-^ fti R(t) = 1 ^r— *) (3.2.8) 

1 - t a'+ t^(b'+ ^— ;-) 

d + t^ 

This approximation is only slightly more accurate, 
the error being about 64% of that of the previous 
approximation, (3.2.5). The number of constants is 
the same, but one multiplication has been replaced by 
a division. Furthermore, this form (3.2.8) is some- 
what more susceptible to round-off errors (due to the 
finite word-length) during evaluation than the form 
(3.2.5). However, this need not bother us if the 
evaluation is done in fixed point (carefully scaled) 
with correct round-off to a floating point mantissa 
at the end. 



For the full range [i, l] the maximum absolute 

-12 
error will be approximately 6.5rl0 (Note: The 

constants a', b', c' and d are not included in this 
report.) 

For the half ranges, [i, \/i]and [/F. ij .the 

-14 
maximum absolute error will be approx. 1.3-10 , 

but this approximation cannot be recommended for a 
full precision fixed-point routine since the round- 
off error will be bigger than in (3.2.5), and (3.2.9) 
will be faster. 



*) For log2 ^ * ^ , multiply the constants a', b, c by 
1 - t 
m 2 = .69314 71805 59945 30941 72321.----. 
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( iv) Approximation of the Form 



In i_!_l2;R(t) = t r a* + — 

1 - t L C + t^ 4 



.2 J 



*) 



(3.2.9) 



e + t'' 

This approximation is much more accurate than (3.2.8); 
it requires one more constant but only one more addition; 
the number of multiplications (1) and divisions (2) is 
the same. This approximation can be used for a full- 
precision fixed-point subroutine without subdividing the 
interval I §, ij. The constants are: 
for 

?... /f 



' /< 

max / " 



mi.n 



^i *max 



.17157288 



, A = 1. 



18 10 



-14 



a* = .57314 62238 34578 TABLE 

b* = -3.83907 86035 23797 (3.2.10) 

c = -3.08667 66195 74836 

d = - .61016 05452 67418 

e = -1.54047 22733 27729 



Which of the approximations given above are most suitable 
for Control Data subroutines will depend not only on 
certain details of coding and machine characteristics, but 
also on the relative merits of saving time or storage space 
and on the range for which a fixed point logarithm may be 
used, i.e., on the number of bits available for the 
fractional part of a fixed point logarithm after suitable 
scaling. 



1 + t 
*) For log^ -: r divide the constants marked with an asterisk 



=2 1 - 
by In 2 = .69314 71805 59945 30941 72321 
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APPROXIMATIONS FOR THE ARCTAN Z 



(4.1) Reduction of the Range 



The addition theorem for tan ( ^ + V), viz. 

, A.J. „i X tan ^ + tan fif 

tan ( 0* ^ ) = 

1 - tan0 tan^t 

can be used to reduce the range for which arctan X or 

arctan (X/Y) must be computed, if we store, in the 

memory of the machine, a table of "key values" z. , ^^, 

\ := tan 1/*^. 

^k = arctan 2^ 
The subroutine will first find, for each argument Z , 
the table entry Z^ which is "nearest" to z in the 
sense that ) arctan Z-^'j^jis minimized. After that the 
algorithm runs as follows: 



t : = 



Zk 



1 + zz. 



*) 



C = tan p ) 



^ := arctan t (approximation) 
arctan Z := '/'j, + 

While this algorithm permits a very drastic reduction of 
the range its cost in time and storage is considerable. 
Time is lost for finding the best Of^,Z^) , and for 
computing t (requiring one each of the four basic 
operations). 



(4.1.1) 



(4.1.2) 



(4.1.3) 



*) To compute arctan X/Y, use 
X - YZk 



t := 



Y + X2i 



(4.1.4) 
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Storage space is needed at least for the Z,, while the 

^ may be equidistant. 

Conclusion ; With a small table , too much time is lost 

and not enough is gained by the moderate reduction of 

the range; a large table will save some time but cost 

more memory space than desirable for a general purpose 

subroutine. 

Special Values ; The formula for t is, of course, very 

simple for Zj, = (t = Z) and for Zj^-^ OO (t = 1), but 

Cm 

also for Zj^ = - lj(t = (Z + 1)/(1 - Z)). With these 
four values, we obtain the following short table: 



RANGE OF Z 


2k 


• v^k 


t 


Z < - (1 + /I) 


-«0 


- ir/2 


1 

~ z 


-(1 + \I2)<Z< - C/2- 1) 


-1 


- "[f/k 


1 + z 

1 - Z 


jzK/T- 1 








Z 


\f2-l < Z< 1 + y/l 


+1 


+ ir/4 


Z - 1 
Z + 1 


z> 1 + v/? 


+ 00 


+ 'Tr/2 


1 

Z 



*) 



(4.1.5) 



The first and last ranges can be joined if it 
is not required that 

- "^/Z < arctan Z < + IT/2 
This method reduces the range to | t|$ v2 - 1 



*) It is true that t can also be computed very easily for arctan X/Y, since, 

1Y1+Z = Y + X ^Z-1 X-Y 
. if Z = XA, 7=y. and = 



1-Z Y-X Z+1 X+Y 

but how can we determine th^ range directly from X & Y, without 
computing Z? 
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(4.2) APPROXIMATIONS FOR ARCTAN t, |t|^ /T - 1 

We have derived the following approximation from the 

continued fraction for arctangent of t 

R, (t) as arctan t if 
o 



R (t) := t [ di + _ -21. 

L t2 + d, -H 



t^ + d, + 



t2 + d. 



(A. 2.1) 



with 



d = 0.20131 20564 40625 303 

ei = 3.11385 00604 57103 14 

d2 = 5.40622 85377 62366 96 

e.2 = -3.92831 57487 32049 88 

dj = 2.71829 04240 10983 87 

ej = -0.15058 39379 13062 15 

d/. = 1.33875 95795 46815 11 



The maximum relative error 



X. 



(4.2.2) 



/\ , : = max 
' |t|$v/2-l 



iog3i_i^ 



arctan t 



= 2.84-10 



-14 



(4.2.3) 



is much smaller than necessary for a good floating point 

routine; it is actually good enough for a fixed point 

routine, since the absolute error, A. , ^3-2" 
' 6 abs 



X 



^ . = max R^ (t) - arctan t = 1.11-10 
6 abs , _ 6 ^ 

t$\/2-l ' 



(4.2.4) 
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For a floating point routine the somewhat simpler approxi- 
mation Re (t) may be used: 



R3 (t) := t djj + t^ (d^ + 



t^ + dj + - =. 



-) 



(A 



-12 
3.9-10 ) 



t^ + d. 



where 



d = 
o 



di = 



«1 = 



d2 = 



d3 = 



0.99999 99999 96107 
-0.01558 53710 18178 
-0.58531 51350 .71831 

2.10055 40871 65198 
-0.41900 30022 82544 

1.62102 58336 34443 



(4.2.5) 
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APPROXIMATIONS FOR TAN X 



(5.1) REDUCTION OF THE RANGE 



The basic range for the tangent is given by the periodicity 
of tan X. 

tan (X i IT ) 2 tan X (5.1.1) 

Therefore, the basic range is 

« - ir/2 ^ X ^ 0< + 1T/2 (5.1.2) 

where 0< may be chosen to be zero. 
Further reductions can easily be achieved by the 
relation (4.1.1) of which 

tan ( ^ t Tr/2) = - 1/tan ^ (5.1.3) 

is a special case. Another important special case is 

t// = + tr/4, tan (/'= - 1, thus 

tan(^ t tr/4) = ^^!! f " ]|. (5.1.4) 

1 + tan 

(5.1.5) cuts the basic range to | X |^ 17/4 ; with 
(5.1.4), we get down to 1 X 1 ^ TT/S *) 

A code for the reduction of the range can be made 

efficient only if we introduce an auxiliary variable W = — • 2 . 

We have arbitrarily chosen k = 2, thus 
4 

ir w 

Let us write t (W) := tan -^V" = tan X (5.1.6) 

4 

then 

t(W i 4) = t(W) (5,1.7) 

t(W - 2) = -l/t(W) (5.1.8) 

t(wi 1) = ^^^^ ~ ^ (5 1 9) 

^ ^ 1 * t(W) (5.1.9) 

*) By help of a table of key values and using (4.1.1), the range can be 
further reduced, saving a little time at great cost in storage. 



W :.= - X (5.1.5) 
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(5.2) BASIC FORM OF RATIONAL APPROXIMATION FOR TAN X 

Since tan X and t(W) are odd functions, we may write 

t(W) = W-T (W2) = -~- (5.2.1) 

S(W'^) 

9 2 
The question is whether T(W'^) or S(W ) should be used as 

an auxiliary function. Assuming that T and S could be 
computed equally fast (with comparable accuracy), the first 
form will be faster for the basic range since a multiplica - 
tion takes less time than a division. However, if one of 
the relations (5.1.8) or (5.1.9) must be used, then the 
second form is much faster since 

t(W i 2) = - J— = — L = 1. (5.2.2) 

t(W) W-T W 
and 

, + , , t(w) - 1 _ WT - 1 wis 

t(W - 1) = = = (5.2.3) 

1 + t(W) 1 + WT S + W 

Which method will be faster on the average? This depends 
on the frequency of arguments being in the basic range 
( |x|^ •'/4 or 'vS) as opposed to those in the ranges 
requiring reduction by (5.2.2) or (5.2.3). If the distri- 
bution is uniform, then the second form is faster. *) The use 

2 

of S(W ) also reduces the maximum time for the subroutine. 

2 
Therefore, S(W ) will be reconimended for a general purpose 

subroutine. 



*) While this is true for both ranges (|x|$ IT/4, fx)^ Tf/S), 

the difference is bigger if the smaller range, and thus (5.2.3), 
is used. 
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(5.3) RATIONAL APPROXIMATIONS FOR S(W ) 



Since 



tan X 



^-rf-rf-rf-rf- 



(5.3.1) 



S(W ) can be expressed by the continued fraction 

S(w2) = = wAani!^ 

t(W) ^ 4 



\^ 



1T\2 
-16 " 



ir 2 1 

•16 W^l . 



4 
^ ' 3 I 5 I 7 

In the formulae given below, this expression has been 
modified in two ways: 

(i) The coefficients have been modified for best fit for the 
respective interval, i.e. so that 



(5.3.2) 



X 



rel. 



= max 



S* (W^) 

In -° T, 

S(W^) 



(5.3.3) 



( = the relative error) is minimized, 
(ii) Simple algebraic transformations have been used to minimize 
the time required for evaluating the respective expression 
on the Control Data 1604. 
If n is the "degree" of the approximation, viz. 



S* (W2) = C^ + 1=^ H. . . - * pQ^ 

We obtain the following short table of maximum relative 
errors ( '^rel.^- 



(5.3.4) 



n 


1 x| i ITa 


|x|$ rr/8 


3 
4 
5 


-8 
1.42-10 

2.21-10"'"^ 

-14 
2.38-10 


4.69-10"-'^^ 
1.83-10~-'^^ 
4.92-10-^^ 



TABLE 
(5.3.5) 
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(i) APPROXIMATIONS FOR |x { ^ fT/U 

REDUCTION OF THE RANGE: 
2 



X (- 



) =:2i + k + V 



TT 


jk + vj $1 


where i and k are integers and 


. |v| ^ i _ 


for k = t 1; tan X := - S/W 




for k = ; tan X := W/S 





ALGOR 
(5.5.6) 



where either 



S := S (v^) = a + 
4 



( X= 2.21-10- -^S 



*) 



c + v^ + 



e + V 



or S := S3 (v2) = a + v^ J b + 



a = 9.45815 57617 25496 

b = 290.32031 00841 78635 

c = -37.33612 85498 26952 

d = -20.54475 60663 69045 

e = -4.64212 22417 14098 

" c 



d + v-^ + 



+ V 



2 J 



X 



2.38 X 10 



-14 



a = 
b = 
c = 



.63661 97723 67596 
-.07531 94869 91705 
3.88560 57227 68290 
d = -14.87026 86251 97861 
e = -57.81869 13873 68667 
f = -9.32191 89536 46030 



(5.3.7) 



(5.3.8) 



*) About 4 to 5 bits (li decimals) may be lost by amplified round-off 
with this slightly unstable approximation. It is recommended to 
evaluate SJ in fixed point (48 bits) with proper scaling. 
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(i 



ii) APPROXIMATIONS FOr|x| ^ TT/S 



REDUCTION OF THE RANGE: 
X(-)=: 4i + k+W 

where i and k are integers and 

+ 



|k + W 1^2 

|w| $; 



for k = - 2 ; 

for k = + 1 ; 

for k = - 1 ; 

for k = ; 



tan X := - SA? 

W + S 



tan X : = 



tan X r= 



S - W 

w - s 



s + w 

tan X := W/S 



* 2 2 c 

Where either: S := S, (W ) = a + W (b + T~rT7) 

3 d +- W 



('\ 



With 



4.69-10~'^''^) 



a = 1.27323 95447 94842 
b = - .07881 15321 78328 
c = 3.11023 62587 99796 
d = -16.99695 38195 49826 



Or 



S : 



= s! (W^) 



a + 



*) 



c + W + 



With 



CA = i.r- 



.83-10 ) 



e + W^ 
a = 19.05374 90250 96548 
b = 2368.80216 75614 4904 
c = -151.08047 87151 32145 
d = -331.56482 92387 31320 
e = - 18.56899 78562 14913 



ALGOR. 
(5.3.9) 



(5.3.10) 



(5.3.11) 



*) About 4 to 5 bits (ij decimals) may be lost by amplified round-off 
with this slightly unstable approximation. It is recommended to 
evaluate S* in fixed point (48 bits) with proper scaling. 
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6. APPROXIMATIONS FOR SIN X AND COS X 

(6.1) REDUCTION OF THE RANGE 

The first and basic reduction of the range of the independent 

variable is brought about by the periodicity: 

sin X = sin (X + 2k TT ) , _ + , + , + , (6.1.1) 

cos X = cos (X + 2k -TT ) "^ " " ^' ^, - ^, • • - 



and the basic relations 

sin (X i TT ) = _ sin X 

+ ^ i' (6.1.2) 

cos (X - 'IT ) = - cos X 



] 



These reduce the range for which rational approximations for 

sin X and cos X must be found to 

- ^/2 ^ X ^ + IT/2 (6.1.3) 

Further reductions are possible, but most of them do not pay: 

(i) Since sin (-X> = - sin X, cos (-X) = cos X, we could reduce 
the range to ^ X ^ 'n*/2. But since this range is not 
sjrmmetric, additional terms would appear in a best-fit 
approximation (viz. even powers for sin X, odd powers fcr 
cos X). Nothing would be gained, neither in speed nor in 
storage space, 

(ii) Formulae such as cos X = 2 cos (X/2) - 1 or 

X p X 

sin X = sin - (3-4 sin'^ -), etc. could also be used, 

but the evaluation of these formulae takes more time than 

can be saved by the respective reduction of the range, except 

for extremely high precision (double or triple precision). 
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(iii) Since the continued fraction and rational approximations 

for tan X are so good, we may be tempted to compute sin X and 



om 


2t 


X 

:= tan - 


sill X - -) 
1 + t^ 

l-t2 




1 + t2 



(6.1.4) 



Here again, the evaluation of 2t/(l + t ) or 

2 2 
(1 - t'^) / (1 + t ) takes more time than can be saved. 

(iv) If we store a table of key values Sj^ = sin Xj^ and 

Cj^ = cos Xj^, then sin X and cos X may be computed from 
sin X = Sir cos 5 + Ci, sin 5 



X = X. + ? 



cos X = C]j cos 8 - S, sin 5 



(6.1.5) 



This requires two additional multiplications and a double 
table look up; in addition, both cos5 and sin* must 
be approximated . This method does not pay off unless the 
range of 5 is made extremely small — but this means a 
long table of many key values (perhaps 1000 or so), 
(v) So far, we have not yet made use of the relation 

sin X = cos ( "^/l - X) (6.1.6) 

This can be used to obviate the need for one of the 
subroutines, *) but 



*) Since sin X should come out as zero for X = 0, and good relative 

accuracy is desirable even if fx) « 1, for sin X, the cos X 

subroutine may be dropped, but the sin X - subroutine should be 
retained. 
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it may also be used to reduce the range by taking "the 

other function" if I x|> X 

where <C X ^ '^/2 

o 

may be determined in such a way that the maximum error of two 
given forms of best-fit approximations for sin X & cos X 
are just equally big. 




GRAPH 
(6.1.7) 



GRAPH 
(6.1.8) 



The sketch above illustrates this situation, assuming that the 

relative error has been minimized. 

This is the only reduction as far as 1 can see, which is worth 

while. 
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(6.2) MATCHED RATIONAL APPROXIMATIONS . FOR 



SIN X, 



< X & COS X, |X|< TTll - Xq 



(i) A pair of approximations of the form 

So 



sin X := X (S^ + 



X' + S3 + 



-S4 



X^ + C, 



cos X := Cj^ + 



C2 



X'^ + C3 + 



C4 



x^ + c. 



1-11 



will yield an error of approximately 2 x 10"^^ *) 
The matching point is approximately XQi^.90. Since 
2 X 10"^^ is too big even for (full precision) floating 
point , the coefficients Sj^ through C5 have not been 
computed yet. They can be furnished on request, 
(ii) A slightly better pair of approximations is 



sm 



X := X (Sj^ + X2 (S2 + x2 (S3 + 



cos X := C]^ + X^ (C2 + X^ (C3 + 



Yi^ + St 



C4 
X^ + C( 



))) 



)) 
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(6.2.1) 



(6.2.2) 



^ =8x10 i^l.lx2 

which may be just acceptable for a 1604 floating point 
subroutine. The matching point is between .88 and .89. The 
coefficients can be computed fairly easily by a relatively 
simple iteration and will be furnished if requested. 
NOTE: The numerical stability of (6.22) is much better 
than that of (6.2.1). (6.2.2) can be evaluated in floating 
point. 



*) All errors quoted are relative, i.e. 



A- 



max 



In 



APPROX. 
FUNCTION 
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(iii) The following pair of approximations will yield 12 
significant digits: 



sin X 



:= x(si 



X2 + S3 + ^^ 



) 



X^ + S5 



cos X := Ci 






X + Cf 



(6.2.3) 



relative error: A = 4. 56 -10 
matching point: X^ = .6271 



■ 13 



COEFFI- 
CIENTS: 
*) 



Sl = 


7.23084 


68962 


44279 


^1 = 


.99999 


99999 


99545 


S2 = 


814,80758 


58531 


22316 


C2 = 


- 1.67714 


58152 


33633 


83 = 


55.40962 


23983 


32114 


^3 = 


271.20667 


68780 


46237 


S4 = 


1262.62414 


34758 


4584 


C4 = 


80.85518 


72908 


64723 


% = 


16.75449 


20850 


08428 


C5 = 


2442.54254 


69501 


6347 










^6 = 


16.33389 


75777 


12791 



(iv) Sincejsin x| ^ l,|cos xj ^ 1, a fixed point sin X 

cos X subroutine is also feasible, approximately 14 digits 
will be needed for full fixed point accuracy. The following 
pair of approximations may be used: 



sin X := X 



[si * x' ( 



2 „2 



S3 



X^ + S^ + ^5 



X^ + Sg 



cos X := Cn + X' 



^C. 



x^ . c, * _ 



5 

-) 



x^ + c. 



(6.2.4) 



A = 9.5-10 



15 



Xq = .8798 



*) cf. "NOTE" on p. 42 



The coefficients for (6.2.4) are: 
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Sj = .99999 99999 99990 520 



52 = - ,32590 23686 32526 89 

53 = 71.63509 21318 01290 7 
84= 113.84748 44349 29748 
S5 = 4600.47709 02433 9799 

Sg = 13.69104 85436 09095 3 



Cj = .99999 99999 99990 455 



C2 = 1.70422 86567 42058 33 
Cj = 276.53438 35490 61398 
C^ = 81.38352 57153 53261 8 
C5 = 2456.97667 28922 5259 
Cg = 16.57290 96384 87355 5 



The numerical stability of the formulae (6.2.4) is fair. Even 
with careful scaling the round-off error may be bigger than the 
truncation error. The next approximation will reduce both at a 
moderate increase in computing time. 

(v) The following pair looks fine for a high-precision fixed-point 
subroutine: 



sin X := X (S, + X^ (S- + X^ (S, + X^ (S, + — 

1 2 3 ^ x2 + 



-)))) 



X^ + S, 



cos X := Cj + X^ (C^ + X^ (C3 + X^ (C^ + !l§ ))) 



X^ + C. 



(6.2.5) 



X = 



-15 
8.1 X 10 , X « .885 
o 



Since sin X and cos X must obviously be scaled by a factor 
of 1/2 in order to avoid overflow, ^^^i = 8.1 x 10~ means 
that the truncation error is only slightly in excess of 1 
unit of the last binary. Since (6.2.5) is very stable, the 
sum of round-off plus truncation error should be less than 
2 or 3 units of the last binary for most values of X. 
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(6.3) RATIONAL APPROXIMATIONS FOR SIN X, j X j 4 '^/2 

Approximations to sin X for the full range, - Tt /2 ^ X ^ + t7*/2 
require one division or multiplication more than those in (6.2). 
The coefficients have not yet been computed, except for the first 
approximation, the accuracy of which is not sufficient for a full- 
precision floating point subroutine. The error estimates for the 
other approximations are believed to be correct within about 
- 10% of the values given below. ( Please ask for coefficients 
if interested.) 



(i) APPROXIMATION OF THE FORM SIN X 'Hi X-'B^ (X^) 

sir 



sin X := X (Sj + X^ (83 + X^ (S3 + X^ (S^ + X^ (S + :^^ Sq ))))) 



rel 



2.1 X 10" 



11 



Sj = +.99999 99999 79082 *) 
82 = -.16666 66660 92171 
S^ = +.00833 33307 30723 

54 = -.00019 84083 38222 

55 = +.00000 27524 01177 
Sg = -.00000 00238 68930 



(6.3.1) 



J 



(ii) APPROXIMATION OF THE FORM SIN X = X F^CX^^/Q^CX^) 
sin X := X(S, + X^ (83 + X^ (S, + X^ (S4 + 



X^ + S, 



)))) 



(6,3.2) 



X ,^1 = 1.0 X lo- 



ll 



This approximation is at least twice as accurate as 
(6.3.1) and can be computed faster, since 2 multipli- 
cations have been replaced by 1 division. Numerical 
stability is very good. 



*) cf. "NOTE" p. 42 
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(iii) APPROXIIATION OF THE FORM SIN X = X-P^CX^VQjCX^) 

2 q 
sin X := X (S, + X (S_ + H )> 

X2 + S4 + ^5 (6.3.3) 

X2 + Sg 

Two more multiplications have been replaced by one division, 
saving time but losing about 10% in accuracy. Numerical 
stability is not as good as in (6.3.2); therefore, a well- 
scaled fixed-point evaluation is mandatoryl 

We now proceed to the next degree of approximation which 

will yield about 8 additional bits: 

(iv) APPROXIMATION OF THE FORM SIN X = X P,(x2) 

o 

sin X := X (Sj + X^ (S2 + X^ (S3 + X^ (S^ + X^ (S^ + X^ (S^ + S^ X^)))))) 

-^ _14 (6.3.4) 

A= 6.2-10 *) 



(v) APPROXIMATION OF THE FORM SIN X = X- Pg(X^)/Qj(X'^) 

2 Sfi (6.3.5) 
sin X := X (Sj + X^ (83 + X^ (S3 + X^ (S4 + X (Sg + ))))) 



X" + Sy 



A ^ 2.3-10-^'^ *) 



(vi) APPROXIMATION OF THE FORM SIN X = X • P^(X^)/Q2(X^) 
sin X := X (Sj + X^ (S + X^ (S3 + — 5-: ))) (6.3.6) 



X-^ + S.; + 



X^ + S- 



\ - -1* V 

A = 2.3-10 *) 



*) Numerical stability: Good for (6.3.4) through (6.3.6) 
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(6.4) SUMMARY AKD NOTE 

In my opinion, the following approximations deserve prime con- 
sideration: 



FLOATING POINT: Almost full accuracy 
Full accuracy 



(6.2.2> 
(6.2.3) 



FIXED POINT: Full accuracy (6,2.5) 

The approximations in (6.3) take more time and are a little less 

accurate. 



NOTE: 



It may be convenient to introduce, in the beginning 
of a sin & cos subroutine, an auxiliary variable 

2 

(cf. ALGOS. 5.3.6), so that rational approximations 
not for sin X, but for sin ZJt_ are needed. The 
coefficients of -such approximations can easily be 
found from those for sin X or cos X. An example 
is given below: 



Substitute 



•rr 



for X in, say, (6.2.3) 



sm 



v-m 



V If 



(Si 



S2 



V ( <r^ . -5 



(vth/2)"^ + S3 + 



) 



V- + ^ + — 
2 2 



(v 1t/2)^ + S5 
— ) 



v^ + (7^ 



cr- _ TT 



2 TT ^ 



^= (— ) S 



rr— 2 4 



^3 = <ir. s^ 
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FINALE PRESTO 

I should have liked to discuss more functions in this report but 
serious time limitations have prevented me from doing so. 

We have some theoretical and numerical results for 



I e dt and J — — dt 
in particular for X-* OO (whence I e dt and I dt). 



X X 



eX_ 1 



A few simple functions, such as tanh X, In cosh X, 
and other functions for which either a well convergent power series or a 
well convergent continued fraction is known, can readily be treated, i.e., 
coefficients of suitable best-fit approximations can be obtained with out- 
codes as soon as we find time for punching and running. 

Some other functions, such as P (X) for real values of X, have 
been studied and more work will be needed to make them ready for machine 
computation of the coefficients for best-fit approximations. 

This semi-formal report has not been checked (for style, spelling 
and mathematical and numerical accuracy) as carefully as we would have 
checked a formal publication. Every comment and correction, and in 
particular reports on numerical checking of approximations given herein 
will be greatly appreciated. 
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